Search CORE

9 research outputs found

Recent Progress in Development of Language Model for Slovak Large Vocabulary Continuous Speech Recognition

Author: Daniel Hládek
Jozef Juhár
Ján Staš
Publication venue: 'IntechOpen'
Publication date: 30/03/2012
Field of study

Categorization of unorganized text corpora for better domain-specific language modeling

Author: Hládek Daniel
Juhár Jozef
Staš Ján
Zlacký Daniel
Publication venue: Vysoká škola báňská - Technická univerzita Ostrava
Publication date: 01/01/2013
Field of study

This paper describes the process of categorization of unorganized text data gathered from the Internet to the in-domain and out-of-domain data for better domain-specific language modeling and speech recognition. An algorithm for text categorization and topic detection based on the most frequent key phrases is presented. In this scheme, each document entered into the process of text categorization is represented by a vector space model with term weighting based on computing the term frequency and inverse document frequency. Text documents are then classified to the in-domain and out-of-domain data automatically with predefined threshold using one of the selected distance/similarity measures comparing to the list of key phrases. The experimental results of the language modeling and adaptation to the judicial domain show significant improvement in the model perplexity about 19 % and decreasing of the word error rate of the Slovak transcription and dictation system about 5,54 %, relatively

Directory of Open Access Journals

DSpace at VSB Technical University of Ostrava

Morphological analysis of the Slovak language

Author: Hládek Daniel
Juhár Jozef
Staš Ján
Publication venue: 'VSB Technical University of Ostrava, Faculty of Electrical Engineering and Computer Sciences'
Publication date: 01/01/2015
Field of study

This paper proposes a new statistic-based method of segmenting words by identification of a suffix. Ability to identify suffix can improve morphological analysis by allowing the classifier to assign tags to words previously unseen in the training corpus. Identified suffix of the word can be used to improve the accuracy of the part-of-speech tagging or other natural language processing task

Directory of Open Access Journals

DSpace at VSB Technical University of Ostrava

Analysis of morph-based language modeling and speech recognition in Slovak

Author: Hládek Daniel
Juhár Jozef
Staš Ján
Zlacký Daniel
Publication venue: Vysoká škola báňská - Technická univerzita Ostrava
Publication date: 01/01/2012
Field of study

The inflection of the Slovak language causes a large number of unique word forms, which produces not only a large vocabulary, but also a number of out-of-vocabulary words. Morph-based language models solve this problem by decomposition of inflected word forms into small sub-word units and resolve the general problem of sparsity the training data. In this paper, we present several rule-based and data-driven approaches to the automatic segmentation of words into morphs. These data are later used in the modeling of the Slovak language for large vocabulary continuous speech recognition. Preliminary results show a significant decrease in the number of out-of-vocabulary words and reduction of resultant language model perplexity

Directory of Open Access Journals

DSpace at VSB Technical University of Ostrava

Survey of Automatic Spelling Correction

Author: Daniel Hládek
Ján Staš
Matúš Pleva
Publication venue: 'MDPI AG'
Publication date: 13/10/2020
Field of study

Automatic spelling correction has been receiving sustained research attention. Although each article contains a brief introduction to the topic, there is a lack of work that would summarize the theoretical framework and provide an overview of the approaches developed so far. Our survey selected papers about spelling correction indexed in Scopus and Web of Science from 1991 to 2019. The first group uses a set of rules designed in advance. The second group uses an additional model of context. The third group of automatic spelling correction systems in the survey can adapt its model to the given problem. The summary tables show the application area, language, string metrics, and context model for each system. The survey describes selected approaches in a common theoretical framework based on Shannon’s noisy channel. A separate section describes evaluation methods and benchmarks

Multidisciplinary Digital Publishing Institute

Unsupervised spelling correction for Slovak

Author: Hládek Daniel
Juhár Jozef
Staš Ján
Publication venue: Vysoká škola báňská - Technická univerzita Ostrava
Publication date: 01/01/2013
Field of study

This paper introduces a method to automatically propose and choose a correction for an incorrectly written word in a large text corpus written in Slovak. This task can be described as a process of finding the best matching sequence of correct words to a list of incorrectly spelled words, found in the input. Knowledge base of the classification system - statistics about sequences of correctly typed words and possible corrections for incorrectly typed words can be mathematically described as a hidden Markov model. The best matching sequence of correct words is found using Viterbi algorithm. The system will be evaluated on a manually corrected testing set

Directory of Open Access Journals

DSpace at VSB Technical University of Ostrava

Classification of heterogeneous text data for robust domain-specific language modeling

Author: A Huang
A Lee
A Singhal
A Stolcke
CD Manning
D Hládek
D Hládek
D Zlacký
Daniel Hládek
F Peng
J Juhár
J Juhár
Jozef Juhár
JS Whissell
JW Reed
Ján Staš
L Yue
M Pleva
N Remeikis
PL Rosin
R Garabík
R Jin
S Darjaa
S Lee
S Tan
SE Robertson
SH Cha
T Joachims
W Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Learning string distance with smoothing for OCR spelling correction

Author: A Bellet
A Eutamene
D Hládek
Daniel Hládek
DW Oard
E Ristad
G Pengcheng
H Yang
J Staš
JH Park
Jozef Juhár
Ján Staš
K Kukich
KU Schulz
Lászlo Kovács
O Tange
P Kantor
RA Wagner
SB Needleman
SF Chen
Stanislav Ondáš
U Reffle
YY Lv
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref